Initial Eploration

Goals

Load dataset into sqlite server
Make basic queries against database
Understand basic structure and fields of dataset
Start exploring different aspects of the dataset and determine possible places of interest to explore in further detail

Imports



In [156]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import sqlite3
import pandas as pd
import seaborn as sns

sns.set_style("white")

Load Dataset

I want to start an sqlite server, load the dataset into an active sqlite server, and be save that functionality into a method to be run in the future.



In [79]:

    
ls -l ../data/output | grep -v "M*.csv"









    



total 8269440
-rw-r--r--@ 1 graysonricketts  staff      861159 Sep 23  2015 CollegeScorecardDataDictionary-09-12-2015.pdf
-rw-r--r--@ 1 graysonricketts  staff    32585814 Sep 23  2015 Crosswalk_ZIP.zip
drwxr-xr-x@ 7 graysonricketts  staff         238 Sep 23  2015 Data_File_Cohort_Map/
-rw-r--r--@ 1 graysonricketts  staff      467161 Sep 23  2015 FullDataDocumentation.pdf
-rw-r--r--@ 1 graysonricketts  staff      390848 Sep 23  2015 data_dictionary.yaml
-rw-r--r--@ 1 graysonricketts  staff  1201238016 Sep 23  2015 database.sqlite
-rw-r--r--@ 1 graysonricketts  staff        1496 Sep 23  2015 hashes.txt



In [57]:

    
conn = sqlite3.connect('../data/output/database.sqlite')
c = conn.cursor()

def execute(sql):
    '''
    Executes a SQL command on the 'c' cursor and returns the results
    '''
    c.execute(sql)
    return c.fetchall()

Apparently, Sqlite is a serverless database (pretty cool, didn't know) so all I need to do is give it a file path, then I can start making queries. I created an execute method to save me couple of keystrokes and make it easier if I change the cursor.

Helper methods



In [157]:

    
def printByYear(data):
    '''
    Given a list of tuples with (year, data), prints the data next to corresponding year
    '''
    for datum in data:
        print "{0}: {1}".format(datum[0], datum[1])

Basic queries and exploration



In [58]:

    
# Print all tables in the database
tables = execute("SELECT Name FROM sqlite_master WHERE type='table'")
for table in tables:
    print table









    



(u'Scorecard',)

There is one table and it is called Scorecard.



In [82]:

    
# Print the number of rows in the database
rowCount = execute("SELECT Count(id) RowCount from Scorecard")
rowCount = rowCount[0][0]
print "Row count:", rowCount









    



Row count: 124699



In [158]:

    
# Number of intstiutions in dataset per year
rowCountByYear = execute("""SELECT Year, Count(id)
                            FROM Scorecard
                            GROUP BY Year""")
printByYear(rowCountByYear)



In [60]:

    
# Print the number of columns
fields = execute("PRAGMA table_info(Scorecard)")
print len(fields)

There are too many columns to do an analysis of the data compostion from the above query. In order to understand what is included in the dataset I am looking through the FullDataDocumentation.pdf which covers the different categories of data and summarizes what is include in each category. Below is a table of things I found of interest.

All the fields in the database are in CollegeScorecardDataDictionary-09-12-2015.pdf along with the field name.

Category	Field Type	Details
School
	Name	Name of the school
	Location	Longitude, latitude, city, etc.
	Main campus or branch	1 for main, 0 for branch
	Type	Public, private for-profit, private non-profit
	Revenue	Net tutition, instructional expenses, average faculty salary
	Currently operating	1 for YES, 0 for NO
Academics
	Programs offered	Field of study ids
Admission
	Admission rate for Undergraduates	Rates for each branch or all branches
	SAT and ACT scores
Costs
	LOTS	Look at page 6-7 of data documentation
Student Body
	Number of degree seekers
	Race statistics	Self-reported on FAFSA
	Undegrads by family income	Percentages broken into categories based on expected family contribution on FAFSA
	Retention rate
	Student body age	Percentage of students who are age 25-64
	Parent education	Percentages of students with parents that have some level of education. Collected from FAFSA.
Financial Aid
	Cumulative median debt
	Percent receiving Pell Grant	Grant from government that does not need to be repaid
	Percent receiving federal loans
Completion
	Completion rate	Percent of students graduating within 150% and 200% of the expected time to graduate
Title IV Students
Earnings
	Mean and median income	Available 6 and 10 years after graduation. Data comes from W2 tax forms
	Threshold earnings	Percent of students who are earning more than people age 25-34 with only a college degree. Measure of wether or not the degree was able to financially improve the students outcome had they not gone to college.
Repayment



In [167]:

    
# Get locations of the universities
coordinates = execute("""SELECT Latitude, Longitude 
                         FROM Scorecard
                         WHERE Latitude IS NOT NULL
                                AND Year=2013
                                AND main='Main campus'""")



In [168]:

    
print "Percent with location data (2013): {0: .2f}%".format(((len(coordinates)*1.0) / rowCountByYear[14][1]) * 100)

# Plot locations of institutions
df = pd.DataFrame()
def checkCordinates(x, y):
    if x >= -128.19 and x <= -65 and y >= 24.19 and y <= 49.62:
        return True
    return False
df['x'] = [row[1] for row in coordinates if checkCordinates(row[1], row[0])]
df['y'] = [row[0] for row in coordinates if checkCordinates(row[1], row[0])]

locations = sns.regplot('x', 'y',
                        data=df,
                        fit_reg=False)
locations.set(title="Locations of Institutions", xticks=[], yticks=[], xlabel="", ylabel="")
sns.despine(left=True, bottom=True)









    



Percent with location data (2013):  74.45%



In [161]:

    
# Number of institutions that are main campuses by year
mainCampuses = execute("""SELECT Year, Count(id) mainCount
                          FROM Scorecard
                          WHERE main='Main campus'
                          GROUP BY Year""")



In [166]:

    
print "Number of total main campuses: {0}".format(sum([count[1] for count in mainCampuses]))
printByYear(mainCampuses)

print "\nMain campus percentages"
for i, count in enumerate(mainCampuses):
    print "{0}: {1: .2f}%".format(count[0], ((count[1]*1.0)/rowCountByYear[i][1])*100)









    



Number of total main campuses: 100918
1996: 5869
1997: 5727
1998: 5513
1999: 5509
2000: 5504
2001: 5523
2002: 5498
2003: 5477
2004: 5492
2005: 5565
2006: 5565
2007: 5542
2008: 5556
2009: 5622
2010: 5729
2011: 5757
2012: 5761
2013: 5709

Main campus percentages
1996:  86.39%
1997:  85.49%
1998:  85.08%
1999:  85.20%
2000:  84.96%
2001:  83.44%
2002:  83.99%
2003:  83.17%
2004:  82.46%
2005:  81.55%
2006:  81.26%
2007:  80.44%
2008:  79.66%
2009:  78.64%
2010:  77.27%
2011:  75.01%
2012:  73.93%
2013:  73.15%

Ideas so far

I think a visualization with a picutre of all the universities in the US would be cool. Maybe as they add certain constaints the user oculd see the dots become highlighted and unlighted. Along with that there would be bar on the right with the most prominent universities in the categoires. Could also allow them to pin certain universities or institutions they were interested in.

As for direction to explore, I kind of what to explore everything, or at least let the user explore everything.



In [ ]: